Effect of term distributions on centroid-based text categorization

نویسندگان

  • Verayuth Lertnattee
  • Thanaruk Theeramunkong
چکیده

Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term distributions, in addition to tf and idf, to improve performance of centroid-based text categorization. Three types of term distributions, called inter-class, intra-class and in-collection distributions, are introduced. These distributions are useful to increase classification accuracy by exploiting information of (1) term distribution among classes, (2) term distribution within a class and (3) term distribution in the whole collection of training data. In addition, this paper investigates how these term distributions contribute to weight each term in documents, e.g., a high term distribution of a word promotes or demotes importance or classification power of that word. To this end, several centroid-based classifiers are constructed with different term weightings. Using various data sets, their performances are investigated and compared to a standard centroid-based classifier (TDIDF) and a centroid-based classifier modified with information gain. Moreover, we also compare them to two well-known methods: k-NN and na€ıve Bayes. In addition to a unigram model of document representation, a bigram model is also explored. Finally, the effectiveness of term distributions to improve classification accuracy is explored with regard to the training set size and the number of classes. 2003 Elsevier Inc. All rights reserved. * Corresponding author. Tel.: +66-2-501-3505-20x2022/2004; fax: +66-2-501-3524. E-mail addresses: [email protected], [email protected] (V. Lertnattee), thanaruk@ siit.tu.ac.th (T. Theeramunkong). 0020-0255/$ see front matter 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2003.07.007 90 V. Lertnattee, T. Theeramunkong / Information Sciences 158 (2004) 89–115

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Empirical Evaluation of Centroid-based Models for Single-label Text Categorization

Centroid-based models have been used in Text Categorization because, despite their computational simplicity, they show a robust behavior and good performance. In this paper we experimentally evaluate several centroidbased models on single-label text categorization tasks. We also analyze document length normalization and two different term weighting schemes. We show that: (1) Document length nor...

متن کامل

Using Class Frequency for Improving Centroid-based Text Classification

Most previous works on text classification, represented importance of terms by term occurrence frequency (tf) and inverse document frequency (idf). This paper presents the ways to apply class frequency in centroid-based text categorization. Three approaches are taken into account. The first one is to explore the effectiveness of inverse class frequency on the popular term weighting, i.e., TFIDF...

متن کامل

Chinese Text Categorization via Bottom-Up Weighted Word Clustering

Most of the researches on text categorization are focus on using bag of words. Some researches provided other methods for classification such as term phrase, Latent Semantic Indexing, and term clustering. Term clustering is an effective way for classification, and had been proved as a good method for decreasing the dimensions in term vectors. The authors used hierarchical term clustering and ag...

متن کامل

Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier Weight Adjustment Schemes for a Centroid Based Classifier *

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well as in finding information on t...

متن کامل

A Feature Weight Adjustment Algorithm for Document Categorization

In recent years we have seen a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intra-nets. Automatic text categorization, which is the task of assigning text documents to pre-speci ed classes (topics or themes) of documents, is an important task that can help both in organizing as well as in nding information on thes...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Sci.

دوره 158  شماره 

صفحات  -

تاریخ انتشار 2004